A benchmark of computational pipelines for single-cell histone modification data

Background Single-cell histone post translational modification (scHPTM) assays such as scCUT&Tag or scChIP-seq allow single-cell mapping of diverse epigenomic landscapes within complex tissues and are likely to unlock our understanding of various mechanisms involved in development or diseases. Running scHTPM experiments and analyzing the data produced remains challenging since few consensus guidelines currently exist regarding good practices for experimental design and data analysis pipelines. Results We perform a computational benchmark to assess the impact of experimental parameters and data analysis pipelines on the ability of the cell representation to recapitulate known biological similarities. We run more than ten thousand experiments to systematically study the impact of coverage and number of cells, of the count matrix construction method, of feature selection and normalization, and of the dimension reduction algorithm used. This allows us to identify key experimental parameters and computational choices to obtain a good representation of single-cell HPTM data. We show in particular that the count matrix construction step has a strong influence on the quality of the representation and that using fixed-size bin counts outperforms annotation-based binning. Dimension reduction methods based on latent semantic indexing outperform others, and feature selection is detrimental, while keeping only high-quality cells has little influence on the final representation as long as enough cells are analyzed. Conclusions This benchmark provides a comprehensive study on how experimental parameters and computational choices affect the representation of single-cell HPTM data. We propose a series of recommendations regarding matrix construction, feature and cell selection, and dimensionality reduction algorithms. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-023-02981-2.

We could see in Fig. 3 that Signac's best performances are achieved for a smaller bin size than the other methods, it is especially surprising because it is supposed to implement the same algorithm as ChromSCape_LSI. In this section we investigate the reason behind this difference.
A close look at the implementation of Signac shows that it does not implement the standard LSI algorithm, but instead a whitened version of it; the principal components (PC) are not weighted by their explained variance. Furthermore the default dimension used is 50 instead of 10 for ChromSCape_LSI. Both methods also remove the first PC, as it is assumed to be mostly driven by the coverage per cell. In order to understand the cause of the shift in optimal bin size we ran Signac with dimension 10, and ran ChromSCape_LSI at dimension 50 and modified it to use whitening.
By comparing the three conditions with a dimension of 10, we can see that the difference in performances between Signac and ChromSCape_LSI in the previous sections can mostly be explained by the number of PCs, indeed except in the case of H3K9me3 the performances of the two are almost identical as can be seen in Fig. S1.
However we can see that in higher dimensions, 50 PCs, the performances are very different. While the top performances are comparable, with a slight advantage for ChromSCape_LSI, Signac has a much tighter range of bin sizes with good performances. This can be explained by the fact that in large dimensions, the later PCs explain less variance than the early ones, and not weighing appropriately induces noise. We can see that when the bin size is small, the explained variance per PC is less concentrated in the first PCs than in the later ones, this is shown in Fig. S2. This explains why Signac has a tighter range of good bin sizes, as its whitening makes the later PCs as important as the first ones.
We thus recommend not whitening when using LSI, as it makes the method less robust to the choice of bin size.
Trade-off between coverage and cell number i n e q 0 _ q 5 0 q 1 0 _ q 6 0 q 2 0 _ q 7 0 q 3 0 _ q 8 0 q 4 0 _ q 9 0 q 5 0 _ q 1 0 0 H3K9me3 a l l b a s e l i n e q 0 _ q 5 0 q 1 0 _ q 6 0 q 2 0 _ q 7 0 q 3 0 _ q 8 0 q 4 0 _ q 9 0 q 5 0 _ q 1 0 0 Coverage condition H3K27ac a l l b a s e l i n e q 0 _ q 5 0 q 1 0 _ q 6 0 q 2 0 _ q 7 0 q 3 0 _ q 8 0 q 4 0 _ q 9 0 q 5 0 _ q 1 0 0 H3K4me1 a l l b a s e l i n e q 0 _ q 5 0 q 1 0 _ q 6 0 q 2 0 _ q 7 0 q 3 0 _ q 8 0 q 4 0 _ q 9 0 q 5 0 _ q 1 0 0 H3K4me3 Figure S3: Study of the effect of cell coverage on the performances of the representations. The all condition contains all the cell as a reference, the baseline condition contains only 50% of the cells uniformly sampled at random, the other 6 conditions contain 50% of the cells, but are sorted by coverage. We order the cells by how much reads they contain and take all the cells from the bottom n% up to n + 50% in order to have the same amount of cells in all conditions. A Best performance across matrix construction, measured for all of the 9 methods, 5 marks of the mouse brain dataset and 8 coverage conditions. B. Performances of Signac and ChromSCape_LSI on H3K4me1 and H3K27me3 as a function of matrix construction. C. Average best performance of the 9 methods across all marks, for the lowest covered cells, random cells, and highest covered cells. D. UMAP projection of H3K4me1 at different coverage qualities, using ChromSCape_LSI across at 30kbp, colored by the labels of [1] obtained from the scRNA-seq co-assays.
While experimentalists do not control the depth of sequencing per cell with the current technologies, we felt that studying how increasing the coverage would impact the quality of the representations to be of interest. Indeed we currently do not know whether there would be a benefit from such an increase, and if there is one, how strong its effect would be. In order to evaluate the effect of coverage, we select 50% of the cells, but constrain them to have similar coverage. For example in the q0_q50 condition we take the 50% cells with the lowest coverage per cell, and in q50_q100 we take the cells with the highest coverage. In a more general way, we sort the cells by coverage per cell, and select the cells whose coverage falls between the n-th percentile, and n + 50-th percentile. This allows us to have all conditions with the same amount of cells, and just study the effect of coverage. We also have a condition where we just sample half of the cells at random, including all coverage, in order to have a baseline to compare against. That protocol is summarized in Fig. S3. This approach of sampling the cells by coverage instead of downsampling the reads per cell has the advantage that it does not make any assumption on the data generation process. Indeed here all the observed cells are real cells, instead of cells that are modified with computational means.
First we can observe that the performances of all methods increase as we increase the coverage, which was expected. However unlike the number of cells in an experiment seen in Fig. 6, the positive effect of more reads per cells does not plateau and is almost a straight line in the case of H3K4me1 as we can see in Fig. S3. If we look at the differences in performances between the least and most covered cells, summarized by mark in Table S12 and by method in Table S13, we can see an increase of at least 35%. That increase even goes as far as 107% for H3K4me3 with ChromSCape_LSI. Looking at the difference between the baseline and the high coverage cells, that increase is still in the order of 15%. It is interesting to notice that this effect on performances is larger than the one obtained by an increase in the number of cells, furthermore as we can see in Fig. S3 this gain is not yet completely achieved in our protocol. That effect is specifically noticeable for H3K27me3. This also agrees with the results on the section studying the role of selecting cells by coverage, where we identified that the marks benefiting the most from this selection were H3K27me3, H3K27ac, and H3K4me1.
The effect of coverage is also consistent across methods, with Signac, ChromSCape_LSI an ChromSCape_PCA benefiting the most form this increase in coverage, above 60%. The first two already being the best performing methods allows to fully reap the benefits from a high coverage.      Table S6: Ratio between the best and worst performances across matrix construction on the raw data (no feature or cell selection applied) for each mark and method combination on the mouse brain dataset.    Table S12: Ratio of the performances between having high coverage (q50_100 condition) in the mouse brain dataset, and either low coverage or baseline coverage, averaged by mark. The "large increase" column is the increase in performance observed against q0_50 where we select the cells with the lowest coverage. The "baseline" column is the increase against no selection.  Table S13: Ratio of the performances between having high coverage (q50_100 condition) in the mouse brain dataset, and either low coverage or baseline coverage, averaged by method. The "large increase" column is the increase in performance observed against q0_50 where we select the cells with the lowest coverage. The "baseline" column is the increase against no selection.